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ABSTRACT 

We present a methodology to discover outliers in catalogs of periodic light-curves. We use 
cross-correlation as measure of "similarity" between two individual light-curves and then 
classify light-curves with lowest average "similarity" as outliers. We performed the analysis 
on catalogs of variable stars of known type from the MACHO and OGLE projects and estab- 
lished that our method correctly identifies light-curves that do not belong to those catalogs as 
outliers. We show how our method can scale to large datasets that will be available in the near 
future such as those anticipated from Pan-STARRS and LSST. 

Key words: methods: data analysis, stars: variables: other, Cepheids, binaries: eclips- 
ing, catalogues, astronomical data bases: miscellaneous. 



1 INTRODUCTION 



One major byproduct of the completed MACHO and ongoing 
OGLE, EROS, and MOA microlensing surveys are catalogs of 
~ 10 variable stars generated from long temporal photometric 
monitoring of stars in selected fields of the Magellanic Clouds and 
the Galactic bulge iFerlet et alJll997t |Paczvriskill200ll) . Most of 
these are comprised of periodic variable sta rs, whose periods were 
estimated via various statistical techniques (Lomb 1976; Reimann 
1 1994b . and a smaller number are comprised of non-periodic vari- 
able stars. Periodic variable stars have been classified by eye, based 
primarily on the visual appearance of their light-curves folded with 
an estimated period, and their locations in the color-magnitude and 
period-luminosity diagra ms. Automatic proc edures are available 
using Fourier coefficients (Morgan et al. 1998) and neural networks 
( Belo kurov et^l2003LlEver & Blake 2005), and others are under 
development iWozniaket^lT2^0^r The reliability of type classifi- 
cation of light -curves with these a utomated techniques is estimated 
to be ~ 90% IWozniak et alj2002l) . 

A natural question that arises concerns the detection of out- 
liers in variable star catalogs, i.e., members whose light-curves de- 
viate at a prescribed statistical level from the rest. There could be 
several reasons for this: a poor or incorrect period caused by noisy 
photometric data, outright misclassification, or, perhaps rarely and 
more interestingly, an intrinsic physical difference such as a slowly 
changing period or brightness amplitude which introduces noise in 
the folded light-curve, analo gous to the longer term variability o f 
the Cepheid variable Polaris jEvans et all2002tlEnele et allgOoj) 
or ap sidal motion in eccentric eclipsing binaries ^Worfet^l200lL 
2004). While catalog membership is nearly complete for variable 
stars derived from the MACHO and OGLE projects, the growth of 
massive databases of variable stars at fainter magnitudes is antici- 
pated (Paczynski 2001) , largely using automate d procedures in tan- 
dem with data-mining ( Belok urov et alj|2003l) . This circumstance 



recommends the development of a fast, reliable procedure to elimi- 
nate contaminating outliers, so they may be subject to later review, 
analysis, and reclassification. Developing such a procedure to find 
outliers in large datasets of variable stars provides the motivation 
for the methodology described in this paper. 

This paper is organized in the following way. Section 2 is de- 
voted to the methodology. In Section 3 we show how our method 
can be extended to a large number of light-curves. In Section 4 we 
present the results from runs on MACHO and OGLE catalogs. Fu- 
ture work is presented in Section 5 and conclusions are in Section 6. 



2 METHODOLOGY 

Our main objective is to identify outliers in a dataset of variable 
stars. The basic procedure is conceptually straightforward; com- 
pare the light-curve(s) in the dataset with that of every other light- 
curve in the dataset, and see which light-curve(s) is least like all 
others. Closer scrutiny reveals some of the difficulties of this pro- 
cess. First, given the size of the datasets (~ 10 5 for existing 
datasets, growing to ~ 10 s in the near future) the comparison 
method(s) must be fast and scale favorably. Second, the size of the 
datasets also prohibits human supervision, so the methods must be 
automated and very robust. 

Finding an outlier requires two separate comparisons. The first 
comparison is between two individual light-curves to determine 
how similar, or dissimilar, they are to each other. This comparison 
will be described in Section l2~2l Once this comparison is done for 
every pair of light-curves in the dataset we form a similarity matrix 
(see Fig.0. Each row of the similarity matrix represents the simi- 
larity of a given light-curve to all other light-curves in the dataset. 
To determine which light-curve in the dataset is least like all oth- 
ers we compare the rows of the similarity matrix and determine 
which row has on average the smallest similarity with every other 
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Figure 1. Similarity matrix for 200 light-curves. Each point represents the 
square of the correlation between two light-curves. Bright points correspond 
to strongly correlated/anti-correlated pairs of light-curves. Dark points cor- 
respond to weakly correlated pairs of light-curves. Note that the indexing 
starts at the lower left hand corner with (1,1). 



light-curve. This second comparison is described in more detail in 
Sectionl2~3l 

We begin by describing the preprocessing of the light-curves, 
and then the actual comparison tests. 

2.1 Preprocessing 

There is no one-shot approach to preprocessing a dataset of light- 
curves. A smoothing technique will remove undesired noise but 
could also remove true features of the light-curve. An interpo- 
lation may generate a more natural looking light-curve but can 
also insert features that are not physical. Sophisticated signal 
processing methods can be used to determine the best smooth- 
ing/interpolation/designaling method, however this will only be 
true for a single light-curve. Since we are dealing with a large col- 
lection of light-curves and essentially we are looking for a few dif- 
ferent light-curves, using a universal preprocessing algorithm is not 
a sensible strategy. For these reasons, we have chosen a minimal 
preprocessing scheme; one that preserves the main light-curve fea- 
tures but does not allow obvious spikes to dominate the statistics. 1 
The steps that are described below in this section are the steps 
used for the analysis done in Section[4]on the MACHO and OGLE 
catalogs. We have however experimented with a number of differ- 
ent schemes and the resulting modules developed will be released 
as part of the software suite. We have concluded that while the 
comparisons between pairs of light-curves do depend on the choice 
of preprocessing scheme the measure of overall outlier does not 
closely depend on the choice of parameters used in the preprocess- 
ing or the preprocessing method (assuming we stay within reason- 
able limits). 

For any measure of similarity to be meaningful, the light- 
curves must be preprocessed to retain the true features of the data, 
while minimizing the effects of noise and spurious measurements. 
Currently our comparison methods require the values of the light- 
curves at predetermined, uniformly spaced, times. 2 Since we need 

1 Here statistics refers to the overall outlier measure which is described in 
Sectionl2~3l 

2 Our current FFT method requires measurements uniformly spaced in 
time. Additionally, any time domain comparison method would require 
knowing the measurements at predetermined times. 



the values of the light-curves at uniformly spaced intervals we need 
to interpolate the light-curves. All light-curves have spurious data 
due to noise and other effects, and many have spikes. 

Any interpolation method may be adversely affected by these 
spikes and high-frequency noise. For this reason we have built into 
our methodology a three step spike-removal/interpolation/data- 
smoothing process. We first perform a running average on the light- 
curve data (spike-removal), we then perform an interpolation to 
obtain the values of the light-curve at prescribed times. We then 
perform a smoothing process on the interpolated data. This smooth- 
ing process is a generalized Savitzky-Golay (SG) smoothing iGorrvl 
1990). 

Running average: Our running average scheme replaces the 
value of each data point by the average of the data points contained 
within a box centered on the data point. Since our data are not 
evenly spaced we weigh the influence any value can have on the 
running average by its distance to the box center. We use a Gaus- 
sian weight that depends on distance from the "current point" and 
has a standard deviation half the window size. The results of a run- 
ning average are somewhat dependent on the width of the running 
window size. Since we wish to remove spikes but not features, we 
determined that a width of 1% of the light-curve phase worked well. 
An extension to this method is to additionally weigh the values by 
the observational error using Gaussian weights. This modification 
turned out to be extremely useful in very large datasets where ob- 
servational errors cannot be accounted in the measure of correla- 
tion. This point will become clearer in the following sections. 

Interpolation: We use simple linear interpolation in order to 
produce uniformly spaced light-curve points. We have found that a 
linear interpolation, in combination with the spike-removal and the 
smoothing, described next, works well in practice. 

Smoothing: The post-interpolation smoothing method uses 
a generalized Savitzky-Golay method. Savitzky-Gola y is a well 
known and widely used smoothing method (Pres s et alll992 ). The 
method we employ is generalized because it does not truncate the 
endpoints of the dataset in the smoothing process. It does this by 
employing the Gram polynomials. A typical implementation of the 
SG smoothing algorithm is, in a sense, a running least squares fit 
to the data and requires solution of a matrix equation as we march 
through the data. Using the recursive properties of the Gram poly- 
nomials, as in iGorrvl I 1990), SG smoothing can be accomplished 
without the need to solve matrix equations. 

There are two adjustable parameters in our SG smoothing. 
The order of the polynomials, and the width of the smoothing win- 
dow. Since smoothing of the data is the principle objective of this 
procedure, we typically use third order polynomials, attempting to 
smooth out the higher order oscillations. The width of the smooth- 
ing window determines the range of influence a given point has 
over neighboring points (the larger the window, the more neigh- 
boring points affect the smoothed value of the current point). Not 
wanting to "smooth-out" any features we determined that a width 
of 4% of the period worked well. A revie w of the p roperties of 
Savitzky-Golay filters can be found in lLuo et alJ t2005). 

Fig. |H shows the modifications in a given to a folded light- 
curve as it is passed through the pre-processing steps described 
above. The points in the top panel shows the original light-curve. 
The solid line in the same panel shows the light-curve after the 
spike-removal is performed. The solid line in the second panel 
shows the final result after interpolation and smoothing. In the same 
panel the results after spike-removal are shown for comparison. 
Upon inspection of Fig. [2] one will notice that the differences be- 
tween the initial, pre spike-removal light-curve and final smoothed 
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Figure 2. A light-curve as it is passed through the pre-processing steps. The 
points in the top panel shows the original light-curve. The solid line in the 
same panel shows the light-curve after the spike-removal is performed. The 
solid line in the second panel show the final result after interpolation and 
smoothing. In the same panel the results after spike-removal are shown for 
comparison. 



light-curve is perhaps not as dramatic as could be achieved, or 
that more smoothing could have been accomplished in the spike- 
removal stage. While this is true we preferred to err on the side 
of caution, resisting the temptation to produce very smooth light- 
curves while being certain to preserve features within the light- 
curve. 

Note that at each preprocessing step we have estimated the 
errors using typical error propagations techniques (see Appendix 
|B]for details). Hence the final light-curve contains observational 
errors that are necessary for the next stage. 



2.2 Comparison of Light-Curves 

For most tests, a comparison of two light-curves is a point-by-point 
comparison of two time series. In this work we have concentrated 
on the use of the correlation between two light-curves as the mea- 
sure of their "similarity". There are many choices of measure of 
similarity and depending on the "features" of the light-curves some 
work better than others. Cross-correlation and chi-square tests are 
the simplest choices. One can show though, that the order of out- 
liers remains the same nonetheless. Future work will investigate 
different measures of similarity. 



2.2.1 Correlation Coefficient of two time series with 
measurement errors 

The uncertainties in the flux measurements of a typical light-curve 
can vary significantly. For this reason any analysis based on the 
flux must account for the variations in the uncertainties in the flux 
measurements. 

The goal is to derive a modified correlation coefficient r of 
two light-curves that incorporates the errors of the measurements. 
We begin by considering the "standard" correlation coefficient 



(without observational errors) of two times series y(n) and x(n) 
where n is the discrete time. For each measurement y(n),x(n) 
there are associated measurement errors o y {n) and o x (n). For the 
moment we assume the averages of y(n) and x(n) to be zero. 

We examine how well the data fit the line y = ax. Using a 
least square fit 



X 



\ [y(n) - ax(n)] 



(1) 



then by taking the derivative with respect to a we can show that the 
X 2 is a minimum when 

y(n) x(n) 

En X2 (») 

Performing a least squares fit on the inverse equation x = /3y, we 
can similarly show that 

J2 n y(n)x(n) 



(3) 



(4) 
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The correlation coefficient is defined as <Weissteinll 999): 



VEn f 2 (")E„ i2 ( n ) 

This is the correlation coefficient without observational errors. In 
the case of observational errors, fitting the linear equations y = ax, 
x = (3y using a \ 2 yields, 

\y(n) - ax(n)] 2 
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(5) 



Setting the derivative with respect to a equal to zero we can show 
that 

<« = =^ (6) 



J2 n y(n)x{n)/a 2 (n) 

and equivalently 

J2 n y(n)x(n)/al(n) 



En y 2 ( n )/ a U n ) 



(7) 



Using the above definition of the correlation coefficient we can 
show 



' En v( n ) x ( n )/^v( n ) E„ y(n)x(n)/a x (n) 
En x2 ( n )/ a U n ) E„ y 2 (n)/ag(n) 



(8) 



If the mean values of x and y are not zero we can extend the above 
analysis by using the following transformations, 

^ ^ — 

y'i -> v%- y- 

Substituting for the new variables in Eq.[8|we can show that 

r 2 = 

xy 



Ei [fa(™)-s)0K«)-*)/4(»)] EJ (H( " ) - s)(;r(n) -^ /CT - ( " ) ] 

^] Ti [( a; (n)-x)2/^(«)]X;„[^( rl )-^ 2 /' T J( rl )] 



(9) 



2.2.2 Cross correlation in Fourier space 

The comparison of two light-curves using the correlation coeffi- 
cient described above hinges on the choice of epoch. Since the 
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phase of the first signal can be arbitrarily chosen a comparison 
could yield a small r 2 even if two light-curves are alike. There- 
fore this arbitrary epoch has to be adjusted for all light-curves prior 
to any comparison. 

An obvious approach is to move the epoch of one of the two 
light-curves until a maximum r 2 is achieved. Though conceptually 
simple, this approach could be quite computationally costly as it 
would need to be calculated for every pair of light-curves. Fortu- 
nately, this can be performed quite economically in Fourier space 
using the convolution theorem. 

The correlation between light-curve x and light-curve y with 
time lag r is given by 

JV-l 

r l y ( T ) = X! x ( n )y( n - T ) > ( 10 ) 

n=0 

where n is the discrete time. According to the convolution theorem 
(see AppendixlA"! the correlation can be written as 

rly(r) = J r ~ 1 [X(y)y(yj\ (r) (11) 

where X(y) is the Fourier transform of x(n) and y(u) is the com- 
plex conjugate of the Fourier transform of y(n). Therefore one can 
find the maximum correlation by finding the maximum of the in- 
verse Fourier transform of the product of the Fourier transforms 
of the two light-curves. For fast Fourier transforms (FFT), each 
Fourier transform requires 2N log(iV) operations, where N is the 
number of observations. Thus for each pair of light-curves a total 
of 6iV log(iV) operations are required. This is to be compared to 
iV 2 operations required doing the analysis in regular space. 

The above equations can be extended to include measurement 
errors (Eq.[8j. 3 

rl y (j) = (12) 
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The top panel in Fig. [3] shows two light-curves with arbitrary 
epochs. The middle panel shows the square of the correlation as a 
function of the time lag, r 2 y (r). The maximum occurs at r ~ 0.3, 
calculated using Eq. 1121 The bottom panel shows the same light- 
curves after one of the light-curves is time-shifted by 0.3. 



2.3 Outlier Measure 

Once we have completed the comparisons of each pair of light- 
curves, thus populating the similarity matrix, we compare the rows 
of the similarity matrix to determine the outliers. For each line in 
the similarity matrix we compute the average of the correlations as 

y^tx 

where y runs over all light-curves in the set except for x and N-lc 
is the number of light-curves. 

For each light-curve we calculate the average of the correla- 
tions as above and then we rank this measure. The light-curves with 

3 Finding the correlation of Eq.l9lwill only require shifting the zeroth com- 
ponent of the Fourier transforms. 
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Figure 3. Top panel shows two similar light-curves with arbitrary epochs 
after being normalized and shifted to set the mode to be one. The middle 
panel shows the square of the correlation plotted as a function of the epoch. 
The maximum occurs at ~ 0.3. Finally the bottom panel shows the two 
light-curves after one of the two light-curves are time-shifted by ~ 0.3. 

the lowest correlation are classified as probable outliers and are fur- 
ther inspected. 

How many light-curves should be inspected? A natural choice 
is to set a threshold based on the actual value of the average cor- 
relation (1Z). For example we could set the threshold at 1Z = 0.3; 
thus any light-curve below that value should be examined. Yet this 
is not exactly what we looking for. Consider the following scenario: 
a catalog consists of light-curves which are all alike (e.g. a collec- 
tion of well separated eclipsing binary stars with circular orbits and 
components that are both O stars). The light-curves of this collec- 
tion will be naturally strongly correlated. If one of the objects in the 
catalog is a binary system with one of the stars being a B star then 
the correlation to the rest of the light-curves will be slightly lower 
but not low in absolute terms. Nevertheless that light-curve will 
have the lowest 1Z in the set thus should be flagged for additional 
inspection. The same may apply to a collection of light-curves that 
are classified together but their light-curves show weak correlation 
(this is an indication that the band in which the observations were 
made is not the primary manifestation of the physical classifica- 
tion) therefore a low correlation does not necessarily mean that the 
particular light-curve is an outlier. Hence, what really matters is the 
average correlation, 1Z, compared to the rest of IZ's in the set. 

One could calculate the expectation value and variance of the 
distribution of IZ's and determined which light-curves are at least 
2cr's away from the mean. This would have been a reasonable ap- 
proach assuming the underling distribution was Gaussian. Unfor- 
tunately this is not true in general. First consider the case that all 
pairs of light-curves have the same correlation, A. The probability 
density function (pdf) of the correlations of this set would be a bi- 
variate normal distribution (which at large N becomes a Gaussian). 
In reality our sets of light-curves do not all have the same correla- 
tions. For simplicity assume that the light-curves could be grouped 
in clusters with constant correlations. Then the resulting pdf will be 
a superposition of bivariate normal distributions each with different 
A. Therefore the final pdf is dataset dependent and may or may not 
resemble a Gaussian. For these reasons, the average of IZ's and its 
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variance has proven not to be a reliable approach. 4 Consequently 
we have concluded that simply selecting the light-curves with low- 
est average correlation (order of 5%) is the fastest and the most 
reliable approach. 



3 LARGE DATASETS 

The numerical method shown in Section [2] works well to identify 
outliers. This will be demonstrated in Section|4]where outliers are 
identified in real light-curve catalogs. 

For a dataset containing ~ 5000 light-curves the run time on 
a typical desktop (3 GHz Intel® Xeon™) is ~ 5 hours. 

The real advantage of a method like this would lie in the ability 
to perform the analysis on much larger data-sets. Unfortunately our 
method scales as JVl C where JVlc is the number of light-curves. 
Fig. [4] gives a graphical representation of the performance of our 
model. It shows running times, in seconds, as a function of JVlc 
in a log-log scale. Superimposed on this plot is a curve that is pro- 
portional to JV LC . For large JVlc we see our algorithm scales as 
7Vl C . Accordingly, for a dataset of ~ 10 5 light-curves the analysis 
will take about 50 days! 5 Consequently we must craft alternative, 
smarter algorithms to deal with larger datasets. 

In the following subsections we present alternative approaches 
to speed up the calculation, each one having advantages and disad- 
vantages. In section l3~T1 we show how, in the case of a simple cor- 
relation coefficient (without the observation errors), the analysis in 
discovering outliers can be reduced from JVl C operations to JVlc 
operations. In Section l3~3l we will show a simple approximation 
that can be applied to the extended correlation coefficient in Eq. I12I 
(including observational errors and allowing time lag to vary). 



3.1 Simple correlation coefficient 

The correlation coefficient between two light-curves i,j is given by 



(JV- 



(14) 



where JV is the number of observations. To identify outliers we 
calculated the average correlation of each light-curve with the rest 
of the set (see Sec. l2.3t . This average correlation is given by 



Ri 



JV LC - 1 

\ j 

JVlc - 1 I ^ 



(15) 



Et (m-fi) (fd(t)-fj) 

(JV - 1) ai aj 



where we sum over all j's and subtract 1 for the i = j case. 
Re-arranging the order of the sums we get 6 

4 We have tested the above empirically and we found that in most cases the 
resulting pdf's are not invariant. 

5 The program could be executed in parallel thus reducing the computa- 
tional time by a factor of n cpu (n cpu being the number of cpu's). However 
the datasets will soon grow to 10 6 thus requiring few thousands of cpu's in 
order to run the analysis in few days. 

6 Here we are making the assumption that all light-curves have same t's. 
This is not true in general but it is true after proper interpolation -something 
we performed in the preprocessing steps. 



Ri = 
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(fj(t)-fi) V (fj(t)-fj) 



We define a centroid light-curve as: 
F(t)- ' 



LC 2— 1 Oi 



JVlc 



and its average centroid light-curve 
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Substituting the definitions of F and F into Eq. ll6l we get 

JVlc Ht{m-h){F(t)-F) l_ 

1 JVlc-1 (JV-l)ffi 

Note that at the limit where JVlc S> 1, 
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Since F(t) and F need to be calculated only once, the number of 
operations necessary to find all Ri's is 0(JVlc + JV x JVlc) ~ 
0(N x JVlc) which is a significant improvement over the 0(N x 
JV^c) which was necessary before. 

This gain does not come without disadvantages. Firstly note 
that we can not apply the same transformation from "average-of- 
the-correlations" to "correlation-to-the-average" in the case of cor- 
relation coefficients using observational errors, since in this case 
the magnitudes and the errors are mixed. Nevertheless this is not a 
major disadvantage since the observational errors can be partially 
taken into account in the averaging/smoothing operations. The sec- 
ond major shortcoming is the fact that the time lag cannot be con- 
sidered as a free parameter. This is because the time lag depends 
on both light-curves thus F(t) cannot be calculated once for all 
light-curves. To circumvent this problem we need to find a priori 
an absolute phase for all light-curves. 

3.2 Universal phasing 

To do just that we have devised the following algorithm of adjusting 
the epoch of all light-curves using clustering methods. The basic 
concept is to find where the signal with the highest/lowest magni- 
tude dip occurs for each light-curve and set it to a particular phase 
by time-shifting the folded light-curve. Since the data are noisy it 
will not be practical to just finding the maximum/minimum value of 
the magnitude. On the contrary, we must find a statistical measure 
of the signal. 

Our method is based on a clustering technique that divides the 
data (here data refers to a single light-curve) into clusters (cluster 
here means a subset of observations within a light-curve) based on 
the magnitude and then finds the cluster with the maximum aver- 
age. 

To find the clusters we required that both the density within the 
clusters and the separation between clusters should be maximum. 
In other words we want the clusters to be as compact as possible 
and be as separated from other clusters as possible. 

We measure the cluster compactness or inter-cluster measure 
of two clusters as: 



Sin 



(21) 
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where CI and C2 denote the clusters, U 's are the times of observa- 
tions in the particular cluster and ici is the average time in cluster 
CI. We also define the intra-distance between the two clusters as 



\tci — tc2\ 



+ 



"CI N C2 

where Nci > s the number of points in cluster CI and 
1 



2 



(JV< 



CI 



1) ^ 

ueci 



(U - tci) 



(22) 
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We define the following measure which by minimizing gives 
us a measure of goodness of clustering, 

Winter 



s 



(24) 



The actual algorithm is described below: 



• For each light-curve we select the highest/lowest 10% magni- 
tude data points. 

• We divide the data in two clusters as t c \ £ {ti, t%, ■ ■ ■ ,t s } 
and t C 2 £ {t s +l, ts+2, ■ ■ ■ , ijv} where s is the index of the separa- 
tor. 

• For each s = {1 . . . N} we calculate the goodness of cluster- 
ing using Eq.[24] If S is minimum within the range 1 < s < N we 
keep the division of data into two clusters. We repeat this process 
in the sub-clusters until no more clustering is favorable 7 . 

• After the clustering is done we calculate the mean magnitude 
and mean time in each cluster. We select the cluster of the highest 
mean magnitude. 

• We translate time such as the mean time of the selected cluster 
is always at the same predefined time. 

By phasing every light-curve to a universal phase the method 
of "correlation-to-the-average" can be applied assuming that the 
observational errors are incorporated in the running average 
method. However the method is an approximation since it does 
not guarantee that the correlation between two light-curves is 
maximum. Nevertheless for most light-curves where a maxi- 
mum/minimum signal is well defined this method should give us 
very similar results to the full method. We have tested this method 
on two sets; 500 light-curves of OGLE Eclipsing Binary stars (EBs) 
and 1000 light-curves of OGLE RRLyrasstars. Fig. [5] and Fig. [6] 
show the runs on these two sets. In each figure we show a his- 
togram of the the rank differences between the full method and the 
approximation described in this section for the bottom 10% of the 
light-curves. EBs do have a much better defined minimum, so the 
approximation performs very well (most light-curves are ranked 
with ±10 of the original rank), whereas the case of RRLyraj's the 
approximation is not performing as well. 



3.3 Outlier analysis within subsets 

Another alternative approach which avoids the drawbacks of the 
method described above is based on a simple statistical argument. 
If a light-curve is an outlier in the whole set it will be an outlier in 
a large subset of the whole set. We could then in principle divide 
the whole set into large subsets and perform the analysis on each 



' Since the data are in a dimensional space it is guaranteed that points in 
the same cluster are sequential. Therefore a separation at a given iteration 
cannot alter the clustering measure of the previous iteration 



1000.0 



100.0 




1000 10000 



Figure 4. Time in seconds to complete the analysis as a function of number 
of light-curves on log-log scale. The points correspond to the actual compu- 
tational times while the solid line corresponds to the TV 2 relation. It is clear 
that for large N's the computational time scales as N 2 . 



subset. If the subsets are randomly selected and the number of light- 
curves is large enough the outlier measure from each subset can be 
put together and hence we can rank all light-curves as if they were 
in a single set. 

Since each subset must be a substantial fraction of the full set 
(^ 10%) the overall performance gain is about a factor of ten at 
best. In the case of large sets this method will not scale very favor- 
ably but it is an "exact" method and it is very easily parallelizable. 

We have applied this method to 16,020 of the RRLyraes from 
the MACHO survey (see Sec.|4j. 



4 RESULTS 

We tested the validity of our method on various periodic star cat- 



collaboration 


[Alcock et alj2000h 


lUdalski etal 


Il997l) 9 . 



Both the MACHO and the OGLE projects were microlens- 
ing surveys devoted to finding gravitational microlensing events in 
the halo of the Milky Way by background stars in the Large and 
Small Magellanic Cloud (LMC and SMC) and the bulge of the 
Milky Way. These surveys also produced large catalogs of variable 
st ars: details on the MACHO variable star research can b e found 
in lAlcock et ail jl995lll996hB.il 997 al) : lcook et alJ jl99l . OGLE 
vari able star catalogs f ound during part II of the project (OGLE- 
II), lUdalski et al. 1997) with accompanying papers, can be found 
on the group website ISoszvnski et alJ2003l|Udalski et alJl999alht 
IWvrzvkowski et all2003F ! 

The variable stars considered were Eclipsing Binar y stars 
(EBs) of which catalogs we re published by MACHO (Faccioli et al. 
I2005U Alcock et alll997bT) and OGLE IWvrzvkowski et alJ2003iT 
RRLy rae and Cepheid s from OGLE jUdalski et alJ 1 1999a: 
IWvrzvkowski et al]|2003l) and unpublished MACHO collections 
that were compiled at Lawrence Livermore National Laboratory 
(LLNL) by Kem Cook, Doug Welch and Gabe Prochter. These lists 
have been generated from the MACHO database by appropriate 
cuts in the period-luminosity diagram. This is only a first step in 



http://www.macho.mcmaster.ca/ 
http://sirius.astrouw.edu.pir ogle/ 
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Table 1. Main features of the catalogs used 



Type of Variable star 


Found by 


Number of Stars 


Cepheids 


MACHO 


3177+ 


Cepheids 


OGLE-n 


1329+ 


RRLyra 


MACHO 


16020+ 


RRLyra 


OGLE-n 


5327+ 


EBs 


MACHO 


6064++ 


EBs 


OGLE-n 


2580+ 



+ EBs from both LMC and SMC were included. 

+ Only light-curves with at least 100 observations were included. 



Figure 5. Histogram of the outlier measure difference between 
the full method and the approximate method for 500 EB's from 
OGLE EB catalog. Only the 10% with the lowest average cor- 
relation were used. 



4 - n 



20 40 60 

Outlier measure change 



80 



Figure 6. Histogram of the outlier measure difference be- 
tween the full method and the approximate method for 1000 
RRLyras from the OGLE RRLyra; catalog. Only the 10% 
with the lowest average correlation were used. 



producing a catalog and thus the resulting lists are expected to be 
contaminated. 

MACHO observations were taken in two non standard band 
passes: MACHO "blue", hereafter indicated as Vmacho, with a 
bandpass of 440-590nm and MACHO "red", hereafter indicated 
as -Rmacho. with a bandpass of 590-780nm; transformations to 
standard Johnson V and Cousins R bands are described in detail in 

10 



lAlcock et allK 

The average number of observations in both bands is several 
hundreds, with the center of the LMC being observed more fre- 
quently than the periphery. 

MACHO period s were found by applying the Supersmoother 
algorithm iReimanrl 1 1 994h . first published by Fried mar] ll984l) . 
The algorithm folds the light-curve around different trial periods 
and selects the one that gives the smoothest folded light-curve. Pe- 
riods were found for the red and the blue band separately, and usu- 

10 Transformation to standard magnitudes is given by (Kem Cook, private 
communication) : 

V = Vmacho + 24.22 - 0.1804(V M acho - Rmacho) 
R = Rmacho + 23.98 + 0.1825(Vmacho - -Rmacho) 



ally agree with each other to better than 1%. The algorithm may 
fail though, usually determining a period for one color band that 
is a multiple of the period found for the other band. In these cases 
the light-curve with the incorrect period will often be flagged as 
an outlier; hence the program can be useful in finding wrong peri- 
ods in a large data set of variable stars (see MACHO Cepheids and 
RRLyraes below). 

OGLE observations were taken in the B, V and I bands and 
reduced via Difference Image Analysis (DIA) (Z ebruri et all20oil) : 
a catal og of variable stars for the Magellanic Clouds was thus pro- 
duced JZebruh et ai]|200lh and from it a sample of 2580 EBs was 
selected (Wvrzvkowsk i et all2003h : we used only / band, DIA re- 
duced observations in our analysis, since the number of observa- 
tions in this band was much higher (on the order of ~ 200-300) 
compared to V and B. 

The main features of the MACHO and OGLE variable star 
datasets are summarized in TableQ 

Results of these runs are presented in the following way: For 
each of the collections listed in Table 1 there are three figures and 
one table. The first figure shows the histogram of the outlier mea- 
sure. The second figure shows the centroid light-curve as defined 
in Eq. 1171 The next 9-panel figure presents the lowest nine light- 
curves, i.e. our outliers. Each panel is labeled according to its po- 
sition in the figure; from Al to C3. Following that there is a table 
which summarizes the properties of these outliers including our in- 
terpretation. These interpretations were formed after further inves- 
tigation including cross correlation with other surveys, position in 
the HR diagram, spectral types where available, etc. 

Cepheids: Cepheids are periodic variables with periods ranging 
from about 1 day to about 50 days (with few extreme examples 
of 200 days) and which lie between the main sequence and the 
giant stars. . Detailed characteristics of their light-curves varied de- 
pending on the period (Hertzprung progression). More details about 
Cephe ids and other variable stars in general can be found in lPetitl 
Jl987l) and lSterken & Jaschekl Jl99rl> . 

The MACHO Cepheid dataset contains a small number of light- 
curves where the folded period is an integer multiple of the "cor- 
rect" period. This can be seen in Fig.|5jAl, A2, B3, C2, C3 . Also 
there is a second bump in the histogram of average correlations 
(Fig. at about 0.1. These light-curves are mostly light-curves 
folded with integer multiple period of the "true" period. Notwith- 
standing, the light-curve shown in A3 in the same figure is clearly 
an EB and not a Cepheid. Bl is evidently a periodic light-curve 
(apparent from the distinct pattern in the folded light-curve) but the 
shape in both R and V bands (only R shown) does not match that 
of a Cepheid (or all subtypes). Further investigation (e.g. spectral 
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type) is needed to determine the type of variable. Note our goal 
in this work is to identify the outliers and thus demonstrate that 
this method can lead us to the few interesting cases. It is not our 
intention to do an in depth investigation for each unidentified light- 
curve only to point out the obvious misclassification's and inter- 
esting cases. Light-curve shown in CI does not look periodic or 
variable for that matter in both bands thus we classify it as "likely 
not periodic" star. 

The OGLE Cepheid catalog lUdalski et ail ll999iA 
IWvrzvkowski et ail |2003) has few true outliers. Only three 
interesting cases did make it into our list (see Fie. ll2l and Fie. U0> . 
From the histogram in Fig. ESI we see that there is no second 
bump but three light-curves are clearly on the lowest bin. Al 
is vaguely a periodic light-curve but there are not enough data 
and they are too noisy. Even if we agree to the periodicity we 
find that the asymmetry is atypical of Cepheids of all types, with 
slow rise and fast decline. Similarly A2 exhibits a clear periodic 
signal but wrong asymmetry. Light-curve A3 is interesting. The 
overall shape, period and color are consistent with a Cepheid. The 
extra regularly spaced spikes are too regular in folded space to 
be ignored. The possibility to be an EB with Cepheid variable is 
highly unlikely since the periods are synchronized (1:5) which 
suggest some fundamental dynamical process. A more careful 
study is needed to understand the physical process underlying 
this light-curve. The rest of the outliers have much higher average 
outlier measure and they are only shown here for consistency (9 
light-curves per catalog). 

RRLyraes: RRLyraes come in many different types but most pre- 
dominately in two subclasses. The RRAB which is the majority of 
them and RRC. These pulsating stars have very well defined period 
(0.5-0.3 days). They are usually asymmetric however a subclass 
RRS does have a sinusoidal shape. It is usually hard to distinguish 
them from Cepheids just from the characteristics of the shape of 
the light-curves. Mor e details about RRLy raestars can be found in 

IPetitl l 19871) and lSterken & Jaschekl I1996B. 

The published OGLE catalog lUdalski et alJll999gJ) is "cleaner" 
(does not contain the wrong types or wrongly stated period of vari- 
ables) than the unpublished MACHO collection. This can be seen 
from Fig. 1131 and 1161 where it is clear that the correlation distri- 
bution of the MACHO dataset is centered closer to zero than the 
distribution of the OGLE catalog (this is due to contamination of 
the MACHO dataset with other variable stats). 

As in the case of Cepheids the RRLyrae MACHO dataset contains 
light-curves that are either folded using a multiple of the true period 
or folded simply with the wrong period in one of the two bands and 
thus appear to be outliers. Nevertheless some of the light-curves 
were most likely misclassified as RRLyraes. Light-curves A2 and 
A3 in Fig. 1 151 have periods of 0.98 and 0.53 days which are too 
large to belong to RRC group. Such periods can be from the RRAB 
group but the shape, amplitude and symmetry of the light-curves 
indicates a non periodic light-curve; hence we ruled them as pos- 
sibly misclassified. Al was identified as the outlier of greatest de- 
gree. However when we looked at the V-band light-curve it had the 
characteristics (period, amplitude etc) of a RRC. Light-curves Bl, 
B2 and CI were simply folded with the wrong period in the red 
band. Looking in the V-band the periods were more in accordance 
to RRC group and the shape, amplitude characteristics are in accor- 
dance with that. 

In the OGLE RRLyra; catalog we identified three light-curves that 
likely do not belong to this catalog. Light-curve Al ofFig.[T8ldoes 
not look periodic and the quoted period and amplitude do not cor- 



respond to a typical RRLyra;. Light-curve CI has quoted period of 
0.86 days and amplitude of < 0.1 in the I-band and hard to make 
out signal. C3 is a light-curve that has period of 0.55 days thus 
most likely belonging to RRAB group but the light-curve is very 
symmetric thus belonging to the RRC group. This is one of the 
light-curves on which further investigation should be performed. 

EBs: Eclipsing Binary stars are not due to physical variation but 
rather due to occultation: one member of the pair of stars passes in 
front of the other. 

MAC HO EB cata logs are submitted for publication in 
iFaccioli et al] J2005). We used the method presented in this 
paper to help free the submitted catalogs from outliers. We found 
few cases of outliers that are shown here but will not be in the final 
published catalogs. These are the light-curves shown in Fig. l2U Al. 
A2, and A3 where all three light-curves have a symmetric single 
occultation and periods consistent more with RRLyraes rather 
than EBs. Light-curve in Bl shows no periodicity however after 
examining the V-band we were convinced that it is a true EBs. The 
Light-curve shown in C3 shows a very noisy light-curve but after 
cross correlating with the OGLE catalog we established that is a 
proper EBs. 

In the OGLE EBs catalog most outliers are EBs with very eccen- 
tric orbits thus appear as outliers since the second minimum will 
rarely be aligned with the second minimum of the rest of the light- 
curves. However light-curve shown in panel C2 is not a typical EB. 
There is either a 3rd body present in the system producing a sec- 
ond occultation or some form of atmospheric variation in one of 
the stars is synchronized with the binary system. Perhaps there is 
a large reflection effect. This occurs when the side of the dimmer 
star that is facing the Earth is illuminated by the brighte r companion 
star th us increasing the luminosity of the system IPollacco & Belli 
Il993t) . This effect also includes radiative brightening. For example 
the system could be a small hot star with a much cooler sub-giant 
or giant component. This light-curve warrants further investigation. 

The reason why the algorithm identifies highly eccentric EBs 
as outliers is well understood. At the same time it is well under- 
stood that this is an indication that cross-correlation may not be 
the best choice of similarity measure. In cases like these a different 
measure of similarity must be employed. These and other potential 
extensions will be investigated in future works. 



Finding outlier light-curves in catalogs of periodic variable stars 




Figure 7. Histogram of the outlier measure for 3297 Cepheids Figure 8 Centroid light-curve for 3297 Cepheids in the MA- 

in the MACHO sample. CHO sample 
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Figure 9. Light-curves with the lowest measure of similarity from the MACHO Cepheid dataset. Only the RED band is shown. 
Table 2. MACHO Cepheids outliers 



Survey 


Type 


ID 


Plot COORD 


Period [days] 


Days of Obs 


Num Obs 


Interpretation 


MACHO 


Ceph 


14.9223.221 


Al 


242.49801 


2721.04 


883 


Multiple period. 


MACHO 


Ceph 


9.4511.14 


A2 


12.19121 


2720.82 


595 


Multiple period. 


MACHO 


Ceph 


4.7459.14 


A3 


6.85445 


2718.04 


278 


EB 


MACHO 


Ceph 


60.7467.9 


Bl 


2.00174 


2717.75 


273 


Periodic but unlikely 
to be Cepheid 


MACHO 


Ceph 


81.9490.26 


B2 


1.14535 


2715.86 


204 


Blue band suggests an EB 


MACHO 


Ceph 


61.8562.27 


B3 


7.33385 


2715.84 


366 


Multiple period. 


MACHO 


Ceph 


20.4309.2977 


CI 


0.70794 


2715.71 


241 


Not periodic/variable. 


MACHO 


Ceph 


77.7067.41 


C2 


8.00520 


2709.83 


1333 


Multiple period. 


MACHO 


Ceph 


79.4659.3452 


C3 


6.96615 


2708.91 


1352 


Multiple period. 
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Figure 10. Histogram of the outlier measure for 3297 
Cepheids in the OGLE sample. 
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Figure 11. Histogram of the outlier measure for 3297 
Cepheids in the OGLE sample. 
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Figure 12. Light-curves with the lowest measure of similarity from the OGLE Cepheid catalog. Only the OGLE I band is shown. 



Table 3. OGLE Cepheids outliers 



Survey 


Type 


Field 


Number 


Plot COORD 


Period [days] 


Days of Obs 


Num Obs 


Interpretation 


OGLE-II 


Ceph 


17 


70123 


Al 


28.96683 


1419 


105 


Not enough/noisy data. 


OGLE-II 


Ceph 


13 


184117 


A2 


13.64083 


1419 


245 


Atypical asymmetry. 


OGLE-II 


Ceph 


21 


40876 


A3 


4.97338 


1418 


248 


Needs further study. 


OGLE-II 


Ceph 


17 


221134 


Bl 


11.22865 


1418 


243 




OGLE-II 


Ceph 


21 


119037 


B2 


0.87813 


1417 


264 




OGLE-II 


Ceph 


14 


114046 


B3 


0.9094 


1415 


238 




OGLE-II 


Ceph 


18 


185847 


CI 


12.20018 


1414 


244 




OGLE-II 


Ceph 


4 


168269 


C2 


0.72923 


1417 


327 




OGLE-II 


Ceph 


4 


427313 


C3 


0.67413 


1414 


454 
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Figure 13. Histogram of the outlier measure for 16080 RRL 
in the MACHO sample. 
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Figure 14. Centroid light-curve for 16080 RRL in the MA- 
CHO sample. 




Figure 15. Light-curves with the lowest measure of similarity from the MACHO RRL dataset. Only the R band is shown. 



Table 4. MACHO RRL outliers 



Survey Type 



ID 



Plot COORD Period [days] Days of Obs Num Obs Interpretation 



MACHO RRL 48.2992.463 



MACHO 
MACHO 
MACHO 



MACHO 
MACHO 

MACHO 
MACHO 



RRL 
RRL 
RRL 



82.8772.705 
73.13488.41 
76.10942.176 



MACHO RRL 79.5499.2627 



RRL 
RRL 



RRL 
RRL 



67.10489.79 
34.9080.261 

37.6316.471 
49.6623.336 



Al 

A2 
A3 
Bl 

B2 

B3 
CI 

C2 
C3 



0.11828 

0.98011 
0.53464 
0.46759 

0.25465 

0.29736 
0.30795 

0.62069 
0.62001 



2720.99 

2717.79 
2716.77 
2714.75 

2710.73 

2707.9 
2700.79 

2697.96 
2689.9 



151 

747 
121 
121 

1387 

273 
156 

125 
178 



RRC.Incorrect period in R-band. 
P v = 0.35484. 
Unlike RRL. 
Unlike RRL. 

RRC. Incorrect period in R-band. 
P v = 0.35042. 

RRC. Incorrect period in R-band. 
P v = 0.33756. 

RRC. Incorrect period in R-band. 
Pv = 0.46193. 
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Figure 16. Histogram of the outlier measure for RRL in the 
OGLE sample. 



Figure 17. Centroid light-curve for 5327 RRLs in the OGLE 
sample. 




Figure 18. Light-curves with the lowest measure of similarity from the OGLE RRL catalog. Only the OGLE I band is shown. 



Table 5. OGLE RRL outliers 



Survey Type 



ID(RA-DEC) 



Plot COORD Period [days] Days of Obs Num Obs Interpretation 



OGLE RRL 053803.42-695656.4 Al 0.3323824 1420 267 

OGLE RRL 053325.94-701109.8 A2 0.2876012 1420 371 

OGLE RRL 052447.86-694319.0 A3 0.2585634 1420 495 

OGLE RRL 052436.03-694541.8 Bl 0.2232339 1420 504 

OGLE RRL 053525.67-702210.2 B2 0.2164212 1420 298 

OGLE RRL 054036.89-701424.8 B3 0.2361195 1420 268 

OGLE RRL 052219.98-691907.1 CI 0.8616601 1420 503 

OGLE RRL 053241.91-702718.9 C2 0.2749622 1420 373 

OGLE RRL 054609.21-702316.7 C3 0.5494448 1420 263 



Unknown; Noisy data. 



Unknown. 

RRAB period but amplitude too small. 
Unknown. 

RRAB period but symmetric. 
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Figure 21. Light-curves with the lowest measure of similarity from the MACHO EB catalog. Only the MACHO R band is shown. 
Table 6. MACHO EB outliers 



Survey 


Type 


ID 


Plot COORD 


Period [days] 


Days of Obs 


Num Obs 


Interpretation 


MACHO 


EB 


64.7964.375 


Al 


0.32279 


2728.73 


263 


RRAB. 


MACHO 


EB 


68.10485.363 


A2 


0.36804 


2714.83 


205 


Asymmetry, period. 
RRAB. 


MACHO 


EB 


27.10782.248 


A3 


0.28717 


2713.96 


294 


Asymmetry, period. 
RRAB 


MACHO 


EB 


212.15797.121 


Bl 


0.67719 


2711.93 


910 


Asymmetry, period. 

Red band is noisy. Blue band is OK. 


MACHO 


EB 


25.3836.269 


B2 


2.27436 


2711.82 


341 




MACHO 


EB 


36.7395.92 


B3 


0.31633 


2702.76 


276 




MACHO 


EB 


22.4871.431 


CI 


3.04861 


2702.73 


530 




MACHO 


EB 


57.4953.114 


C2 


1.25421 


2660.68 


278 




MACHO 


EB 


80.7194.423 


C3 


2.60078 


2649.06 


1370 
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Figure 22. Histogram of the outlier measure for 2580 EBs in 
the OGLE sample. 



Figure 23. Centroid light-curve for 2580 EBs in the OGLE 
sample. 
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Figure 24. Light-curves with the lowest measure of similarity from the OGLE-II EB catalog. Shown here is only the OGLE I band is shown. 



Table 7. OGLE EB outliers 



Survey 


Type 


ID(RA-DEC) 


Plot COORD 


Period [days] 


Days of Obs 


Num Obs 


Interpretation 


OGLE-II 


EB 


052937.78-700903.4 


Al 


15.03314 


1239 


503 


Eccentric orbit. 


OGLE-II 


EB 


051915.79-693808.1 


A2 


8.03376 


1238 


432 


Eccentric orbit. 


OGLE-II 


EB 


051519.31-692640.3 


A3 


15.96256 


1235 


360 


Eccentric orbit. 


OGLE-II 


EB 


051858.34-693946.4 


Bl 


2.29555 


1235 


473 


Eccentric orbit. 


OGLE-II 


EB 


051700.39-691813.8 


B2 


5.29129 


1235 


368 


Eccentric orbit. 


OGLE-II 


EB 


052521.32-694858.9 


B3 


4.12088 


1234 


500 


Eccentric orbit. 


OGLE-II 


EB 


051734.54-692736.5 


CI 


14.58252 


1234 


325 


Eccentric orbit. 


OGLE-II 


EB 


051657.87-690328.1 


C2 


5.66141 


1238 


365 


EB with reflection effect. 


OGLE-II 


EB 


050646.85-683700.4 


C3 


12.14988 


1420 


264 


Eccentric orbit. 
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5 FUTURE WORK 

This paper is not intended to study all possible methods for finding 
outliers in datasets of light-curves but rather to help demonstrate 
and hopefully convince others how an automatic method like this 
can be applied to facilitate the discovery of new, interesting variable 
objects. Special emphasis should be given to the choice of measure 
of similarity. An attempt to study this issue will be made in a second 
paper where we will study how to employ more than one measure 
of similarity. 

In this paper we have used particular preprocessing tools 
and we tweaked our preprocessing steps for each catalog. We are 
planning a full released of the software which will include many 
preprocessing options and optimized algorithms as a downloadable 
software and as an on-line tool and web services in the near future 
jhttp : / / darwin . cf a . harvard ■ edu/ Light Curves/ s/) . 



6 CONCLUSIONS 

In this paper we presented a methodology based on cross- 
correlation as a measure of similarity that enables us to discover 
outliers in catalogs of periodic light-curves. We established the 
methodology in Fourier space and extended the cross-correlation 
to accommodate observational errors. 

The results from the application of our method on catalogs 
of classified periodic stars from the MACHO and OGLE projects 
are encouraging, and establish that our method correctly identifies 
light-curves that do not belong to these catalogs as outliers. 

We have identified light-curves that were simply misclassified, 
light-curves that were folded with the wrong period and so appear 
different, and light-curves that emerged as unique. 

We show how with careful approximations our method 
can be applied to very large catalogs thus making it a 
useful tool for the upcoming new surveys Pan-STARRS 
jhttp : / /pan-starrs . if a . hawaii ■ edvj and LSST 
jhttp : / /www . lsst . org) . 

We have nonetheless also concluded that a single measure of 
similarity is not adequate to capture all features for all types of 
light-curves and we understand that an extension of our method 
that utilizes more measures (comparison of Fourier components, 
wavelet coefficients etc) or combinations of measures has to be car- 
ried out; these will be presented in a future paper. 

It is worth mentioning th at other works per forming automated 
classification of light-curves I B rett et all 120041) ) can also, in prin- 
ciple, find outliers. However since their focus is classification there 
is no guarantee that an outlier will be identified. This is because a 
light-curve must be clearly decoupled from all clusters in order to 
be considered as an outlier where in our case, since we do not have 
clusters, any light-curve can be classified as an outlier. This distinc- 
tion is important in order to appreciate the advantage of our method. 
Moreover a classification method cannot scale as N whereas our 
method can do so in some approximation schemes. 

We would like to make one last point. The situation of datasets 
that are not fully processed is going to become more common as the 
larger surveys come on-line. In the near future it will become nearly 
impossible to fully "clean" datasets without the use of automated 
methods such as the one presented here. We believe we have shown 
that our method has great utility at a number of steps along the 
processing pipeline. 
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APPENDIX A: CONVOLUTION IN FOURIER SPACE 

Let x(n) and y(n) be arbitrary functions of discrete time n with 
Fourier transforms. Take 

JV-l 

x{n)=^[X(v)](n) = 1 ]T X(y)e^ N (Al) 



n = 
JV-l 



-2iriisn/N 



V{n) = F- 1 [y{v)] (n) = jrJ2 y ^ e 



n = 
JV-l 



2niun/N 



, (A2) 



(A3) 



n = 
JV-l 



e 2 ™ n ' N , (A4) 



where X and $ are the complex conjugates Fourier transforms and 
T~ 1 (n) is the inverse Fourier transform. The correlation given a 
time lag r 



r ly{ T ) = E x ( n )y( n - r )> 



(A5) 



= T- 1 [y[y)X{y)] (r) . (A6) 



APPENDIX B: ERROR PROPAGATION 

The SG smoothing can be written as a simple linear sum over 
neighboring points 



E 



(Bl) 



where the coefficients d are the difficult thing to deduce, but have 
no errors in them (they do not depend on the data). The error in the 
smoothed value is then given by, 



(B2) 




(B3) 



To get the value of a measurement y for a given x using linear 
interpolation between the two points (xi , yi) and (x2 , 2/2) we have, 



y = VV2 + (r) - l)yi , 
where r\ is defined as: 

x — Xl 

V = • 

X2 — Xl 

Using the rules of error propagation, 




9y \ 2 . ( dy 



r, yi 1 'In 

dyi J \dy 2 
calculating the derivatives we find, 
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Similarly we can estimate the errors for the running averages where 
the running averages are: 



y= E 



(y-v y r 
e 2^ yi , 



(B8) 



i£ window 

where lo is the window size. Estimating the derivatives we get 

(y-yj) 



a l= E 



iG window 



1 - 



Vi (y - Vi) 
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